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In  a  recent  paper  [7],  we  demonstrated  the  usefulness  of  enlarging 

the  standard  set  of  decision  epochs  in  a  number  of  semi-Markov  decision 

processes  in  which  the  time  between  transitions  is  exponential.  The 

approach  employed  in  [7]  was  to  augment  the  standard  set  of  decision 

epochs  —  usually  the  times  the  system  changes  state  —  so  that  the 

exponential  transition  time  has  parameter  A,  independent  of  both  the 

current  state  and  the  action  selected.  The  motivation  and  power  of 

this  approach  stem  from  the  fact  that  the  augmented  set  of  decision 

epochs  results  in  an  n-period  problem  in  which  the  expected  horizon 

length  (is  n/A  and)  does  not  depend  upon  either  the  control  policy 

employed  or  the  initial  state  of  the  system.  This  new  formulation 

does  not  typically  lead  to  the  dissipation  of  desirable  properties  — 

such  as  monotonicity  and  concavity  —  of  the  return  function  as  is 

often  true  in  the  "improper"  standard  formulation. 

Presently,  we  intend  to  make  use  of  this  technique  in  the  context 

of  truly  continuous  time  problems  with  denumerable  state  space  and 

finite  action  space,  where  by  a  continuous  time  problem  we  mean  one  in 

which  the  decision  maker  must  select  an  action  at  each  and  every  instant 

of  time.  Specifically,  we  will  show  that  the  continuous  time  problem  P 

can  be  obtained  as  the  limit  of  a  sequence  of  approximating  problems  P . 

where  P  is  the  obvious  semi-Markov  version  of  P  but  with  exponential 
N 

parameter  A  .  of  course,  we  must  have  A  *  00 .  In  particular,  it  is 
N  N 

shown  that  the  return  in  P„  of  any  measurable  policy  converges  to  its 

N 

return  in  P. 
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The  advantage  of  obtaining  P  as  the  limit  of  the  P is  that  certain 
structural  properties  present  in  each  P  persist  in  the  limit.  For 
example,  if  the  optimal  policy  is  monotone  in  that  it  uses  larger  actions 
from  larger  states  or  if  the  optimal  return  function  is  convex  for  each 
N,  then  these  properties  are  preserved  in  passing  to  the  limit.  One 
could,  of  course,  deterministically  form  a  grid  of  decision  points  spaced 
1/A„  time  units  apart  rather  than  do  this  stochastically  as  we  suggest. 
For  the  applications  we  envisage,  however,  use  of  a  deterministic  grid 
renders  the  approximating  problems  P„  themselves  difficult  to  solve. 

Previously,  Miller  [11]  successfully  treated  the  undiscounted  finite 
horizon  problem,  and  the  infinite  horizon  problem,  with  and  without  dis¬ 
counting,  has  been  covered  by  Kakumanu  [5]  and  by  Miller  [12].  The 
focus  and  intent  of  this  paper,  however,  is  not  to  present  a  theory  of 
continuous  time  Markov  decision  processes  but  rather  to  present  a  method 
or  approach  for  dealing  with  problems  and  models  whose  natural  formula¬ 
tion  results  in  a  continuous  time  Markov  derision  process;  such  models 
often  possess  the  kind  of  structure  necessary  to  render  our  approach 
applicable. 

The  requisite  notation  is  introduced  in  section  2  while  our  approach 
and  main  results  are  presented  in  sections  3  and  4  for  the  finite  and 
infinite  horizon  problems,  respectively.  The  final  section  contains 
several  applications. 


II.  NOTATION  AND  PROBLEM  DEFINITION 


We  consider  a  continuous  time  Markov  decision  process  with  count¬ 
able  state  space  S  and  action  space  A  *  X  A  in  which  each  coordinate 

seS  8 

A  is  finite.  For  convenience,  we  take  S  to  be  the  nonnegative  integers, 
s 

The  reward  rate  associated  with  being  in  state  i  and  selecting  action  a 

is  denoted  by  r(i,a),.and  the  transition  rate  to  state  j  from  state  i 

while  employing  action  a  is  given  by  q(j|i,a).  Of  course,  q(j|i,a) 

for  j  ¥  i  and  q(i|i,a)  =  -l...  q(j|i,a). 

Jri 

Following  Miller  [11]  and  Kakumanu  [5]  ,  a  policy  r  is  simply  a 
mapping  from  S*(0,T]  into  A,  where  T  is  the  time  horizon.  Thus, 
only  deterministic  memoryless  rules  are  allowed.  In  addition,  we  require 
that  7r(i,t),  the  action  prescribed  by  ir  at  time  t  from  state  i,  be  meas¬ 
urable  in  t  for  each  i.  Of  particular  interest  are  stationary  and  piece- 
wise  constant  policies.  If  for  each  i  there  is  an  integer  ni  <  “  and 

a  sequence  0  =  t_<  t,  <  . . .  <  t  *  T  such  that  ir(i,t)  is  constant  on 
o  —  l  —  —  n . 

l 

[tj,tj+1>,  j  *  0,1, . . .  ,ni»l,  then  it  is  said  to  be  piecewise  constant. 

If  tt  is  piecewise  constant  and  n^  s  1,  then  *  is  said  to  be  stationary. 

As  we  shall  see,  an  optimal  policy  can  be  found  among  the  class  of 
stationary  policies  if  T  ■  00  and,  with  an  additional  structural  condi¬ 
tion,  among  the  piecewise  constant  policies  if  T  <  00 . 

While  our  definition  of  a  piecewise  constant  policy  differs  from 
that  of  Miller  [12]  ,  they  are  the  same  if  S  is  finite.  Let  F  be  the 

set  of  maps  from  S  into  A  so  that  *  »  {ffc},  ft  e  F-  Then»  according  to 

Miller,  it  is  piecewise  constant  jf  there  is  a  sequence  0  ■  tQ  <  t^  <  . . . 

<  t  *  T  <  00  such  that  t,t'  e  (t.,t.  .)  implies  f  =  f  ,,  j  *  0,l,...,n-i. 
n  3  3 '  ^  t  t 


j 
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Miller  [11]  proved  that  for  S  and  T  finite  and  a  ■  0  there  is  a  piecewise 
constant  policy  that  is  optimal.  His  result  does  not,  however,  remain 
valid  if  S  is  countable  as  evidenced  in  the  following  example:  Take 
r(i,a)  3  i^  +  a,  A^  =  {0,1},  and  q(i-l|i,l)  -  1  -  -q(i|i,l),  q(i|i,0)  =  0. 
Define  f_.  t  F  by  f_.  (k)  =  0  for  k  <  j  and  f ^  (k)  =  1  for  k  >  j.  It  is 
clear  upon  reflection  that  given  T  <  05  there  is  an  m  <  «  and  a  strictly 


increasing  sequence  <tj>0  with  tQ  =  0  and  t ^  <  T  all  j  such  that  tt*  is 

optimal  and  tt* ( t)  *  f.,  for  t  e  [t,,t.  ,),  j  *  0,1,...  . 
r  3+m  3  3+1  J 

So  that  the  expected  return  of  each  policy  is  finite,  we  need  to 

make  the  following  three  assumptions.  The  first  merely  stipulates  that, 

with  probability  1,  only  a  finite  number  of  changes  of  state  take  place 

in  a  finite  amount  of  time.  Because  our  interest  in  such  systems  is 

motivated  in  large  part  by  queueing  reward  systems,  we  do  not,  as  is 

typical,  require  {r(s,a)j  to  be  bounded,  but  rather  we  impose  two  less 

restrictive  assumptions  (see  [8]).  Assumption  2  places  a  polynomial 

bound  on  max  |r(s,a)  |  while  Assumption  3  requires  that  movement  to  dis- 
a 

tant  states  carry  small  probability. 


Assumption  1:  There  is  a  finite  constant  A  such  that 


(1)  A  =  sup  {— q ( i  |i  ,a)  :  a  e  A^  i  e  S}  . 

Assumption  2:  There  are  finite  integers  K  and  m  so  that  for  each  i  0 
we  have 

(2)  max  jr(i,a)  |  <_  K(ivi)m  . 

aeA. 
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Assumption  3;  There  is  a  b  <  <*>  such  that  for  each  i, 

(3)  max  -I  (jvl)n  q(j|i(a)/q(i|i,a)  1  (i  +  b)n,  n  -  lf2f ...  ,m  . 
a£Ai 

Given  a  policy  ir,  denote  the  total  expected  a-discounted  reward 

earned  on  the  time  interval  [t,T]  when  starting  at  time  t  from  state  i 

by  V  (ir,i)  so  that 
®»t 

T 

V  „<»)  -  /  r  (ir(  0  ><3£  , 

°'t  t 

where  P(t,5,ir)  is  the  unique  transition  probability  matrix  associated 
with  it.  Also,  define  the  optimal  return  function  V  by 

01#  u 

V  <i)  =  sup  V  (7T,i)  . 

A  policy  tt*  is  said  to  be  optimal  if  Va  t(n*)  »  all  t  ^  T.  When 

T  a  oo,  we  simply  drop  the  t  and  write  V&  and  ( tt)  .  Finally,  we  refer 

to  the  above  as  problem  P  or  simply  P. 

Associated  with  P  is  a  sequence  <P  >  of  approximating  problems  each 

N 

of  which  is  a  semi-Markov  decision  process.  First  consider  the  case 

N  N 

T  <  00  and  set  A^  *  2  /T.  Then  by  problem  N  we  mean  that  2  -period  prob¬ 
lem  with  state  space  S,  action  space  A,  and  reward  function  r  given 

d#  N 

by  r^  N<s,a)  =  r(s,a)/(a+Ajj) ;  the  transition  time  is  exponential  with 
parameter  A^  for  each  (s,a)  e  S*A,  and  the  law  of  motion  q^  is  given  by 

q(  j  I i»*)/\,»  j  t  i 

(4)  a  ( j | i,a)  = 

([An  +  q(i|i,a) ]/An,  j  “  i  . 
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(Note  that  0  iff  A^  _>  A.  Naturally,  we  are  only  interested  in  P^ 

with  A  >  A,  and  although  we  will  label  probleats  P,  ,V  it  is 

understood  that  the  first  few  P  are,  when  necessary,  to  be  omitted.) 

If  ”  is  piecewise  constant,  uhen  wc  can  define  the  action  selected 

by  policy  tt  in  P  when  the  current  state  is  i  and  n  transitions  remain 
N 

to  be  ir(n/AN,i),  and  the  return  of  policy  tt  in  P  from  state  i  when 

n  transitions  remain  is  denoted  by  V  (TT,i).  However,  if  v  is  not 

a,n,N 

piecewise  constant,  this  definition  could  lead  to  executing  actions 

in  P^  which  only  rarely  are  executed  in  P.  For  example,  if  T  =  1,  S  =  {0}, 

r(0,i)  =  i,  and  tt  (t)  =  1  if  t  is  rational  and  0  otherwise,  then  V  (tt)  =  0, 

0  #  1 

yet  vn  -iN  In  view  of  chis,  we  employ  the  following  definition. 

With  probability  tt  (i,a)  ,  action  a  is  selected  by  policy  tt  in  P  when 
n,N  N 

the  current  state  is  i  and  n  transitions  remain  where  tt  M(i,a)  is  the 

N  N  n,N 

2w_n  2  -n+1 

Lebesgue  measure  of  (t:Tr(t,i)  =  a,  t  e  [—. — ,  — - - )}.  If  tt  is  piece- 

an  an 

wise  constant,  then  tt  (i,a)  converges  to  1  [0]  if  11  (n/A  ,i)  =  a  a] 

n,N  N 

uniformly  in  i,  a,  and  n  as  N  ”,  so  that  the  two  definitions  lead  to 

the  same  asymptotic  behavior  (in  n  and  N)  of  V  (it).  Similarly,  we 

o  f  n  f  n 

define  the  return  function  v  „  for  P„  by 

a,n,N  N 


V  ri  M'iJ  =  SUP  V  „  '  11  =  0 , 1 , 2  ,  .  .  .  ,  2N  . 

a,n,w  a,n,N 

TT 

An  alternative  and  more  useful  formula  for  V  „  is  (V  _  =  0) 

a  ,n,N  ct^O^N 


A 

Vn.N(i)  ■  {r(i'a)  *  3i0q<3!i'a>Vo.n-l,N<j)  1  + 


N  aEA. 
l 


N 


Finally,  given  0  t  <_  T,  let  <tN>  be  a  nondecreasing  sequence  with 


\  - 2  vA»  * 
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For  t-he  case  T  *  P  i  s  defined  as  above  except  that  we  set 

N 

N  N 

h.  ■  2  A,  take  P„  to  have  infinitely  many  periods  rather  than  2  ,  and 

N  N 

let  n  ..(I,*)  be  the  action  prescribed  by  tt  in  P„  for  the  nfc^  transition 
n ,  n  N 

rather  than  when  n  transitions  remain.  In  this  case,  we  write  V  „(n) 

a,N 

and  V  „  for  the  return  of  -n  in  P„  and  the  return  function  itself. 
a,N  N 
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III.  FINITE  HORIZON  RESULTS 

We  begin  by  demonstrating  that  Vq  t  N(n)  converges  to  Vo  t(ir) 

for  every  tt  and  that  V  „  converges  to  V  .  Next,  we  make  use  of 

a'tN#N  a,t 

<P}J>  to  construct  a  policy  tt*  defined  on  the  diadic  rationals  in  tO,T] 

with  the  property  that  V  ..(it*)  -  V  ^  „  converges  (on  Borne  subse- 

a,tN,N  «»tN*N 

quence)  to  0.  Roughly  speaking,  tt*  is  optimal  on  a  countable  dense  set 

of  (0 ,T] .  By  imposing  an  additional  structural  condition  on  <P  >  —  and 

N 

hence  on  P  —  tt*  has  an  (essentially)  unique  extension  and  is  optimal 
for  P.  In  addition,  we  show  that  the  optimal  return  function  uniquely 
satisfies  the  appropriate  functional  equation. 

LEMMA  1:  If  tt  is  piecewise  constant,  then 


V  ¥  m(tt)  -*  V  (tt) 
a,tN,N  a.t 


I 


$ 


Proof:  Throughout  the  proof,  the  initial  state  i  is  fixed,  and  we  employ 
the  special  definition  for  piecewise  constant  policies  in  PN*  Let  e  >  0 
be  given.  To  begin,  define  B  to  be  the  event  that  during  [0,T]  the  set 
{0,1,. . .  ,i+J}  of  states  is  left  by  the  process  induced  by  tt,  either  in  P 
or  in  some  P  .  Now  Assumptions  1-3  ensure  (see  [8])  than  C  <  «,  where 

C  -  max  {1;  V  (Tr,i);  sup  V  (w ,i)  >  , 

a,t  N  a,TN,N 

so  we  can  fix  J  so  that  P(B  )C  <  e. 

J 

Let  SN(w)  c  it,w)  — ■  be  the  set  of  times  during  which  the  actions 

specified  by  tt  and  those  induced  by  tt  in  P  are  different.  We  claim 

N 

^ While  the  expected  duration  of  problem  N  is  T,  the  realized  duration 
may,  in  fact,  be  much  larger  than  T. 
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that  for  any  y  >  0 


P(L  <  y  B  )  -*■  1  as  N  -v  ® 

M  J 


where  L^w)  is  simply  the  Lebesgue  measure  of  S^fu).  (As  SN(w)  is,  with 
probability  1,  merely  a  finite  union  of  intervals,  exists  and  Li 

itself  P-measurable. ) 

To  begin,  let  t  ■  tQ  <  t^  <  . . .  <  tM  =  T  be  an  enumeration  of  the 
3et  of  times  when  the  action  specified  by  tt  from  any  state  j  <_  i+J  changes. 
Define  k(N,j)  to  be  the  smallest  nonnegative  integer  such  that  t  +  k(N,j)/A., 
tj.  Take  N  sufficiently  large  so  that 


D =  I  (t  +  k(N,j)/A  -  t.)  <  y/2  . 
j*0  3 

(The  quantity  Dn  represents  an  upper  bound  on  LN  if  each  period  in  P N 

had  length  precisely  1/A^,  its  mean.) 

Let  t,  be  the  sum  of  k(N,j)  -  k(N,j-l)  independent  exponential 
N,; 

random  variables  each  having  parameter  A  ,  and  assume  that  the  components 

N 

of  {t  .}  are  also  independent.  Next,  define  T„  *  t ,,  +  ...  +  t„  ,  . 
N,  3  r  N,  j  N,1  N,j 

The  claim  is  established  upon  verifying  that 


P(  £  |tm  -  E (T  )  |  <  y/2)  -►  1  as  N  -►  »  , 

j=l  N'3  N,;) 


which  follows,  using  the  Chebychev  Inequality,  by  noting  that 


P  ( 1 1  -  E  (t  )  I  <  — ^— )  >1 - —  var  (t  . ) 

* '  w ,  j  j*  2M2  “  y2  N-j 

4M4  k(N, j)  ~  k(N, j-1)  4M^  k(N, j)  -  *(N,j-l) 

*  1  "  2  .2  2  N 

Y  An  Y 

.  4M4T2  1 

-1'  —  ’  N  ‘ 


Let  5^  be  that  subset  of  the  sample  space  such  that  for  all  s  in 

[t,T]  we  have  Xg  -  X^gf  where  Xg  and  Xj^  g  are  the  states  of  the  process 
at  time  s  in  P  and  P  ,  respectively. 

Since  the  maximum  rate  at  which  the  state  can  change  is  A  <  <*>, 
P(Sn|ln  <  8;  Bj)  _>  e  .  This  fact  coupled  with  (5)  enables  us  to  con¬ 
clude  that 

(6)  lie  sup  PlBj  0  f<i^  r^))  <_  PtBjJd  -  e"AY)  . 


The  absolute  difference  in  cost  in  P  and  P  for  any  u  in  L  <  y  0  S  P 

N  N  N 

is  at  most  y  max{  |r (n,a) | : n  <,  i+J,  a  e  An>  =  yR.  Similarly,  this  dif¬ 
ference  for  w  €  Bj  ^  <SN  <  Y>  is  at  most  RT.  Now  choose  y  so  that 

-  Ay 

YR  <  e  and  RT(l-e  )  <  e.  Consequently,  for  N  sufficiently  large. 


lv„  *  M<*,i)  *  V  <n,i)  |  <  P (B  )C  +  YR  +  RT(1  -  e~Ar)  <  3e  . 

,ZU'N  a,t  J 

Q .E  •  D. 


THEOREM  2.  For  any  policy  it, 


Proof :  Since  it  is  measurable ,  there  is  a  piecewise  constant  policy  it 
with  the  property  that  for  each  i  the  measure  of  the  subset  of  [0,T] 

such  that  7T(i,t)  ft  ir(i,t)  is  less  than  e/21.  Noting  that 


00  t 


we  need  only  verify  that  each  of  these  three  terms  is  small  for  N  large. 
Lemma  1  states  that  the  middle  term  converges  to  zero.  The  third 

term  is  small  because  the  probability  that  different  sample  paths  will 

-A2e 

result  is  at  most  1-e  :  2Ae ,  so  that  the  arguments  of  the  proof  of 

Lemma  1  suffice.  To  see  that  the  first  term  is  small,  merely  observe 
that  1-e  e  is,  for  N  sufficiently  large,  an  upper  bound  on  the  prob¬ 
ability  that  the  sample  paths  are  not  exactly  the  same.  When  the  sample 
paths  are  the  same,  the  actions  selected  are  the  same  except  perhaps  for 
an  effective  time  interval  of  measure  4e.  Thus,  the  arguments  of  the 
proof  of  Lemma  1  again  suffice. 

Q.E.D. 

LEMMA  3.  For  each  t,  V  V  .  . 

-  a,tN,N  a,t 

Proof :  From  Theorem  2,  we  can  choose  ir  with  V  (t)  >  V  -  c  so  that 


V  f  M(i)  >  V  (ir,i)  -  V  (n,i)  . 

a,tN,N  -  a,tN,N  a,t 


Thus,  lim  inf  V  .  „  >  V  .  . 

a'VN-  “•t 

To  see  that  lim^sup  <  v^.  observe  that  ^  H<«N) 

and  we  can,  without  loss  of  generality,  take  it  to  be  constant  on  the 
2N  intervals 


J  -  °'1 . 2"-1-  S1"“  V«,tN,S<'N>  ’  Vt^'V  i 


V°  t*ie  continuity  °f  va  s  s  yields 


lim  sup  V  <  lim  sup  V  .  ■  V  . 

N-+®  a' VN  ~  H—  a'VN  a'fc 


Q.E.  D. 


ioM  S 1 


. ani . aiiiiiiiiiiiili 
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We  now  show  how  to  construct  a  policy  w*  that  is  optimal  for  problem 

P.  To  begin,  for  each  i  and  each  point  in  the  set  D  of  diadic  rationale 

in  [0,T]  ,  we  define  w*  so  that  actions  are  selected  in  accord  with  an 

optimal  policy  for  infinitely  many  P  .  Assuming  the  presence  of  a  certain 

N 

structural  property  given  below,  we  extend  the  definition  of  ir*  from  the 

dense  set  D  to  all  of  [0,T]  and  then  verify  that  tt*  is  optimal  for  P. 
i  N 

Define  D„  «  T:j  *  0,1,. ..,2  },  note  that  D„  Jr  D,  and  for  each 
N  N  N 

t  e  D,  say  t  *  jT/2  ,  let  A,  be  that  set  of  actions  which,  for  infinitely 

i ,  t 

N~M 

many  N,  is  optimal  from  state  i  in  P„  when  j2  transitions  remain.  By 
diagonalization  and  the  fact  that  each  A^  is  finite,  we  can  find  a  sub¬ 
sequence  <n  >  with  the  following  property.  For  each  i  <  N  and  each 
N 

t  e  D  (say  t  »  jT/2M)  ,  there  is  a  subset  A.  .  of  A.  .  such  that  a  e  A,  J 
N  l ,  t  l  ,t  l,t 

njt"M 

means  that  a  is  optimal  in  P  when  j2  periods  remain  for  all  k  >  N; 

"k 

N  =  1,2,...  . 

We  say  that  <P„>  is  connected  if  for  each  state  i  and  each  N  the 

N  " 

optimality  of  action  d  when  n  and  m  periods  remain  with  n  <  m  implies 

that  d  is  also  optimal  when  n+l,n+2,...,  and  m-1  periods  remain. 

Order  the  elements  of  each  Ai  and  set  ir*(i,t)  to  be  the  minimal 

element  of  A  for  t  e  D  and  i  c  S.  Assuming  that  <P  >  is  connected, 

1ft  N 

we  can  define  the  policy  it*  to  be  the  unique  left-continuous  extension. 

It  is  worth  noting  that  tt*  inherits  all  structural  properties  —  such 


as  using  faster  rates  from  higher  states  —  present  in  <P  >,  including 

N 

the  property  of  connectedness.  We  will  make  extensive  use  of  this  in 


the  applications. 
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THEOREM 


_4.  If  <?N>  is  connected,  then  the  policy  tt*  is  optimal  for  P. 


In  particular,  it*  is  peicewise  constant  and  for  each  i  has  at  most  card 
switches. 

Proofs  Let  i  be  the  initial  state.  Since  <P„>  and  tt*  are  connected, 
__  N 

the  relative  frequency  of  periods  in  P  when  tt*  is  not  using  an  optimal 

nN 

action  from  states  {0,1, . . .  ,i+J}  is,  for  each  sample  path,  at  most 
i+J  N 

fN  ■  I  card  A. /2  ,  where  J  <  N  is  chosen  so  that  the  probability  of 
k*0 

reaching  a  state  larger  than  i:-J  when  using  either  the  optimal  policy  or 

tt*  in  P  is  less  than  e  for  all  N.  As  f„,  -*•  0,  the  idea  of  the  proof  of 

nN  N 

Lemma  1  suffices  to  yield  V  ^  <  tt  * ,  i )  -  V  ^  (i)  •+■  0,  which, 

,n  a,t  ,r, 

nN  n  nN  n 

coupled  with  Theorem  2  and  Lemma  3,  establishes  the  optimality  of  tt*. 

Q.E.D. 

REMARK.  If  instead  of  <P„>  connected  we  assume:  for  each  i 

N 

B.  =  sup  {#  of  periods  in  which  the  optimal  action  differs  from 
* 

preceding  period's  action  for  P„  and  state  i}  <  »  , 

N 

then  the  proof  works  with  replacing  card  A^  and  tt*  is  piecewise 
constant  but  not  necessarily  connected. 

Restricting  attention  to  the  case  a  ■  0  and  S  finite.  Miller  Ill] 
provided  the  following  necessary  and  sufficient  condition  for  optimality. 
The  method  of  proof  used  here  is  Miller's. 

THEOREM  5.  A  necessary  and  sufficient  condition  for  a  policy  tt  to  be 


optimal  is  that  for  almost  all  t  e  [0,T]  , 


(7) 


r (f )  +  Q(f)  <|»(t) 
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Is  maximized  over  the  set  F  of  maps  from  S  to  A  by  ir  (t) ,  where  q(j|i,f(i)) 
is  the  ij^  element  of  Q(f)  and  the  column  vector  4>  (t)  is  the  unique 
absolutely  continuous  solution  to  j 

I 

(8)  -<Mt)  -  r(n(t))  +  Q(v  (t)  )iji  (t)  -  cnMt)i  <MT)  =  0  . 

Moreover,  the  solution  to  Equation  8  satisfies 
T 

(9)  <Mt)  -  /  e  S-t)P(t,s)r(TT  (s)  )ds  -  V  (ir)  . 

t  ®  #  ^ 

Proof:  The  proof  is  nearly  that  of  Miller  [11]  ,  as  the  case  a  >  0  is 
straightforward  (see  [9]  for  details).  To  extend  the  proof  to  the  case 
S  countable,  it  sufficies  to  demonstrate  the  uniqueness  and  existence 
of  a  solution  to  Equation  8.  To  beqin,  write  r(s)  and  Q(s)  for  r(ir(s)) 
and  Q(tt(s)),  define  the  Banach  space  B  by 

B  i  {<u(i)>ieS  :  SUP  I  u(i)  | /(ivl)m  <  “}  , 

and  let  M  be  the  (metric)  space  of  continuous  functions  on  [0,T]  into  B 

with  metric  d(x,y)  *  max  ||x(s)-y (s)  ||,  where  ||  ||  is  the  norm  in  B.  Next, 

0<s<T 

define  the  map  F:W*M  by 

T 

[Fx(t)]i*  (T)  +  /  { ri  ( s )  +  (QfsJxfsJ^-ax^sjMs  . 

It  will  suffice  to  show  that  F  is  well  defined  and  has  a  unique  fixed 
point.  Noting  that 

co  n 

(Q(s)x(s)).  »  £  q(j  I  i,7r(s,i)  )x  .  (s)  *  lim  £  g(  j  |  i,n  (s,i)  )x .  (s)  , 

1  j=0  3  rr^°  j=0  3 


it.'*  •  W«a it  .hul,.  -l.-i. «  v’ & 


utit  ii.V ■jft-wf  H. . 


o 


o 


* 


t 


* 


& 


we  see  that  (Q(s)x(s)K  is  measurable  since  x^  (s)  is  continuous, 
q(j|  i,n  (s,i))  is  measurable,  and  the  limit  of  measurable  functions  is 
measurable  (the  sum  converges  by  Assumption  3).  Hence,  F  is  well  defined. 

From  Assumption  3,  it  follows  that  there  is  a  uniform  bound  on  ||  Q ( f )  j| 
over  F,  say  D,  so  that  ||Q(s)||  <_  D,  0  <_  s  <_  T.  Consequently,  d(F(x),F(y))  <_ 
TD,  so  that  F  is  a  contraction  and  has  a  unique  fixcid  point  if  TD  <  1. 

If  TD  is  not  less  than  one,  then  merely  choose  a  time  T*  for  which  T'D  <  1 
and  piece  together  the  desired  solution  by  appropriate  choice  of  the 


initial  value  (T) . 


Q.E.D. 


in  view  of  Equation  9  and  Theorem  5,  we  are  lead  to  inquire  whether 
the  optimal  return  function  Va  satisfies 


(10)  — ( t)  ■  max  (r(f)  +  Q(f)iMt)}  “  a'f'(t),  tMT)  2  0  . 

feF 


COROLLARY  6.  There  is  an  optimal  policy,  and  the  optimal  return  function 
is  the  unique  solution  (in  M)  to  (10). 

Proofs  Let  B  and  M  be  given  as  in  the  proof  of  Theorem  5  and  define 
F:M*M  by 

X  ao 

(11)  [Fx (t) ] ,  *  /  max  {r(i,a)  +  I  q(j |i,a)x, (s)  -ox,(s)}ds  . 

1  t  aeA.  j=0  3  1 

l  J 

Fix  x  e  M,  i,  and  a  e  A^.  We  claim  that  the  maximand  is  continuous. 

To  see  this,  choose  t  e  [0,T] ,  let  e  >  0  be  given  and  choose  6  >  0  so 
that  || x (s)  —x <t)  ||  <  e/(i+b)m  whenever  |s-t|  <  6.  Then  by  Assumption  3 


we  have 


16 


|£q(j | i#a)x^ (s)  -  Eq(j |i,a)x^ (t) | 

<  Eq(j|i,a)|x.(s)  -  Xj(t)|  <  Eq < j | i,a)  e  (jvl)m/(i+b)m  <  e  , 

justifying  our  claim.  Consequently,  for  each  i,  the  action  attaining 

the  maximum  can  be  chosen  measurably  so  that  F  is  well  defined.  Finally, 

F  has  a  unique  fixed  point  which,  by  Theorem  5,  is  V  . 

a 

Q.E.D. 
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IV.  INFINITE  HORIZON  RESULTS 

The  infinite  horizon  problem  is  considerably  more  straightforward 
than  the  case  T  <  <*>.  In  fact,  the  existence  of  an  optimal  stationary 
policy  follows  from  three  simple  observations.  First,  for  any  stationary 
policy  f  e  F  the  return  of  f  is  the  same  in  P  and  <PN>?  that  is 

(12)  V  (f)  -  Vn  M(f) ,  all  N  . 

a  a,N 

Second,  by  known  results  concerning  semi-Markov  decision  processes  (see 

Theorem  1  of  [8J )  ,  there  is  an  f  e  F  such  that 

a 

(13)  Va,N(fa)  "  8up  Va,N(n)'  a11  N' 

where  the  sup  is  taken  over  all  policies,  including  randomized  and  history 

dependent  policies.  Third,  for  each  c  >  0  and  initial  state  i,  we  can 

(in  view  of  Assumptions  1,  2,  3)  find  a  time  T  .  <  °°  such  that  the  total 

e  »i 

expected  a-discounted  reward  received  after  time  T  .  when  starting  from 

e ,  i 

state  i  is  less  than  e.  Coupling  this  with  Theorem  2  yields 

(14)  V  „(ir)  -*■  V  (tt)  ,  for  each  policy  w  . 

a,N  a 

THEOREM  7.  For  each  o  >  0,  there  is  an  f  e  F  such  that  V  (f  )  =  V  . 

■  I  ..I  I  '  QJ  QJ  01  Cl 

Let  V(tt,T, i)  denote  the  total  expected  reward  earned  by  time  T  when 
employing  policy  it  and  starting  from  state  i,  and  define  V(tt),  the 
average  expected  return  per  unit  time  of  policy  it,  by 

V(it)  =  lim  inf  V(ir,T)A  • 

Then  if  there  is  an  f*  e  F  and  a  sequence  a  *  0  with  V  (f*)  =  V  ,  we 

n  a  a 


have,  employing  a  standard  Abelian  result  (see  Lemma  1  of  [6]), 


V(ir)  *  lim  inf  V(ir,T)  <  lim  inf  aV  (tt)  <  lim  inf  a  V  (ir) 

a-K)+  a  n-H»  n  an 

^  lim  inf  a  V  (f*)  *  V(f*)  . 
n-H»  an 

The  existence  of  such  an  f*  is  ensured  if  S  is  finite.  More  important, 
however,  is  the  fact  that  the  existence  of  such  an  f*  can  often  be  veri¬ 
fied  in  countable  state  problems  arising  in  the  context  of  specific 
applications  (see  Example  2). 


t 
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V.  APPLICATIONS 

The  following  three  examples  taken  from  the  queueing  optimization 
literature  illustrate  the  use  of  our  approach. 

EXAMPLE  1.  Optimal  Custoaer  Selection  in  an  M/M/c  Queue  (Miller-Cramer- 
Lippman) 

We  consider  the  problem  of  determining  which  customers  to  admit 
into  the  system  so  as  to  maximize  the  expected  a-discounted  reward  earned 
over  a  finite  horizon  of  length  T  in  an  M/M/c  queue  with  finite  queue 
capacity  Q.  Each  customer  class,  l<^k<_K<®,  is  distinguished  only 
by  the  reward  r^  associated  with  acceptance  of  a  class  k  customer  into 
the  queue  and  the  Poisson  arrival  rate  X^.  Each  of  the  c  exponential 
servers  has  rate  u.  For  convenience,  label  the  customer  classes  so  that 

0  <  rl  <  r2  <  <  V 

Even  thoujh  the  rewards  are  received  in  lump  sums  and  not  as  rates, 

the  following  clever  problem  representation  due  to  Miller  [13]  permits 

formulation  as  a  continuous  time  Markov  decision  process.  Take  S  = 

(0,1,2,. ..  ,c+q}  so  that  being  in  state  i  means  that  there  are  i  customers 

in  the  system.  For  i  <  c+Q,  take  A^  to  be  the  power  set  of  {1,2,...,K} 

so  that  each  action  a  e  A^,  a  subset  of  {1,2,...,K},  merely  stipulates 

which  customer  classes  are  to  be  admitted  (Ac+g  “  0)  •  Then  the  reward 

and  transition  rates  are  given  by  r(a,i)  =  I  X  ,r , ,  q(i+l|i,a)  ■  E  X., 

.  ,  jea  3  3  jea  3 

q(i-l|i,a)  »  Uac)U,  and  q(j|i,a)  B  0,  j  /  i-l,i,i+l. 

By  considering  the  appropriate  semi-h'arkov  version  of  this  problem 

K 

(with  any  rate  A  >_  qj  +  Z  X  . )  ,  it  can  be  shown  (see  Theorems  4,  5,  6  of 

1  3 

[7])  that  the  minimal  reward  accepted  when  the  discount  factor  is  a, 
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there  are  i  customers  in  the  system,  and  n  transitions  remain  i3  an 
increasing  function  of  1/a,  i,  and  n.  Consequently,  the  approximating 
problems  are  connected  so  we  can  conclude  from  Theorem  4  that  associated 


with  each  a  ^  0  there  is  an  optimal  policy  it  with  the  following  property 

the  set  ffQ(i,t)  of  customer  classes  accepted  from  state  i  at  time  t  is 

nonincreasing  in  T-t,  i,  and  1/a?  moreover,  n  (i,t)  is  of  the  form 

a 

{ j,  j+1, . . .  ,K}.  (This  result  was  obtained  by  Miller  [Theorem  7.3,  10] 
for  the  case  Q  *  a  *  0.) 

EXAMPLE  2.  An  M/M/1  Queue  with  Variable  Service  Rate  (Crabill-Sabeti) 

The  problem  posed  by  Crabill  [2]  was  to  determine  the  service  rate 
y  e  {Mj}^  to  employ  so  as  to  minimize  the  expected  average  cost  per  unit 
time  in  an  M/M/1  infinite  capacity  queue  with  arrival  rate  X  in  the 
presence  of  a  holding  cost  h  per  customer  per  unit  time  and  a  service 
cost  rate  c_.  associated  with  the  service  rate  y  ^  For  convenience, 
label  0  <  <  u2  <  . . .  <  yR  and  assume,  as  is  reasonable,  that 

<  c2  <  ...  <  c^.  Our  interest  is  in  minimizing  the  expected 
a-discounted  cost  incurred  during  a  horizon  of  length  T  <_  +«. 

Here,  the  natural  formulation  —  the  state  is  the  number  of  cus¬ 
tomers  in  the  system  and  the  action  is  the  rate  to  employ  —  suffices. 
Utilizing  results  from  [7],  we  cam,  as  in  Example  1,  conclude  that  for 

T  <  00  amd  each  a  >_  0  there  is  an  optimal  policy  ir  such  that  it  (i,t)  , 

a  a 

the  optimal  rate  from  state  i  when  t  time  units  remain,  is  nondecreasing 
in  i,  t,  and  1/a.  Furthermore,  Theorem  7  enables  us  to  assert  that  the 


optimal  rate  to  employ  when  T  ■  ®  is  a  nondecreasing  function  of  both 


A  reward  for  service  completions  can  also  be  included  (see  [7,  p.  36]). 
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1/a  and  1.  (Note  that  with  the  natural  formulation  in  the  infinite 
horizon  version  of  Example  1  there  are  no  actions  available  unless  an 
arrival  occurs,  whereas  actions  are  continuously  available  in  this 
variable  service  rate  model.) 

EXAMPLE  3.  Optimal  Admission  to  an  M/M/l  Queue  (Einmons) 

We  consider  the  single  server  version  of  En*nons'  [3]  M/M/c  infinite 
capacity  queue  with  arrival  rate  1  and  service  rate  y.  The  problem  is 
to  dynamically  determine  whether  or  not  to  adnit  arriving  customers  into 
the  system,  counterbalancing  the  reward  r  received  for  admitting  a  cus¬ 
tomer  against  the  possible  overtime  service  cost  which  is  incurred 
beginning  at  time  T  and  continuing  until  the  system  is  empty.  Customers 
cannot  be  admitted  after  time  T.  Our  goal  is  to  maximize  the  expected 
a-discounted  net  profit. 

In  order  to  represent  the  problem  as  a  continuous  time  Markov  deci¬ 
sion  process,  Einnons  introduced  a  non-zero  terminal  condition  to  incor¬ 
porate  the  overtime  cost  and  utilized  Miller's  method  to  handle  the  rewards 
associated  with  customer  acceptance.  Instead  of  assuming  that  the  over¬ 
time  service  cost  c(t)  associated  with  closing  t  units  late  is  linear, 
we  assume  that  S(i),  the  expected  a-discounted  value  of  overtime  service 

costs  associated  with  i  customers  present  at  time  T,  is  convex  in  i. 

8t 

For  example,  c(t)  ■  xe  with  8  a  yields  the  desired  convexity. 

In  order  to  verify  that  the  approximating  problems  are  connected, 
we  need  the  following  two  results.  The  first,  Lemma  8,  states  that  there 
is  a  sequence  <in>  of  critical  numbers  with  the  property  that  it  is 
optimal  to  accept  a  customer  when  n  transitions  remain  and  i  customers 
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are  currently  in  the  system  iff  i  <  i  .  The  second  result#  Lenina  9, 

n 

states  that  iQ  £  ^  <_  i2  <_  . . .  .  In  view  of  Leninas  8  and  9,  we  can 
employ  Theorem  4  as  in  Examples  1  and  2  to  verify  chat  there  is  an  opti¬ 
mal  policy  for  P  characterized  by  a  sequence  T  tQ  >_  t^  j>  ...  >  0  of 
critical  nunfcers  as  follows:  a  customer  seeking  admission  at  time  t 
when  there  are  i  customers  in  the  system  will  be  admitted  iff  t  <  t^. 

(It  is  likely  that  t^  ■  0  for  some  k  <  »  and  tQ  ■  T  iff  r  <  S  (1) . ) 

Fix  A  >  A+y  and  a  >  0.  Then  V  (i),  the  n-period  return  starting 
“  n 

from  state  i  satisfies  (VQ(i)  «  — S (i) ) 

V  (i)  -  ~r  max  {Ar  +  Av  (i+1)  +  yV (i-1)  +  (A-A-y)V  (i)  ; 
n+l  o+A  n  n  n 

yV  (i-1)  +  (A-y)v  (i) }  , 
n  n 

or 

Vn+1(i)  -  (A  max  [r  +  vn(i+l);0]  +  yVn(i-l)  +  (A-y)Vn(i)}  , 

where  v  (i+1)  ■  V  (i+1)  -  V  (i). 
n  n  n 

LEMMA  8.  For  each  n,  V  (.)  is  concave;  that  is,  v  (i)  >  v  (i+1)  all  i. 

1  n  n  —  n 

LEMMA  9.  For  each  n  :>  0  and  each  i  >.  1,  vn+1(i)  i  v^i). 

The  proofs  of  Leninas  8  and  9  are  given  in  [9] . 


... 


23 


REFERENCES 


1.  Cramer,  M. ,  "Optimal  Customer  Selection  in  Exponential  Queues," 

ORC  71-24,  Operations  Research  Center,  University  of  California, 
Berkeley,  1971. 

2.  Crabill,  T.B. ,  "Optimal  Control  of  a  Queue  with  Variable  Service 
Rates,"  Ph.O.  Dissertation,  Cornell  University,  1968. 

3.  ESnmons,  H. ,  "The  Optimal  Admission  Policy  to  a  Multiserver  Queue 
with  Finite  Horizon,"  J.  Appl.  Prob. ,  9,  103-116  (1972). 

4.  Harrison,  J.M.,  "Discrete  Dynamic  Programming  with  Unbounded  Rewards,1 
Ann.  Math.  Stat. ,  43,  636-644  (1972). 

5.  Kakumanu,  P.,  "Continuously  Discounted  Markov  Decision  Model  with 
Countable  State  and  Action  Space , "  Ann,  Math.  Stat. ,  42,  919-926 
(1971). 

6.  Lippman,  S.A. ,  "Semi -Markov  Decision  Processes  with  Unbounded  Rewards, 
Management  Science,  19,  717-731  (1973) . 

7.  Lippman,  S.A. ,  "A  New  Technique  in  the  Optimization  of  Exponential 
Queueing  Systems,"  Working  Paper  No.  211,  Western  Management  Science 
Institute,  UCLA,  October  1973. 

8.  Lippman,  S.A.,  "On  Dynamic  Programming  with  Unbounded  Rewards," 
Working  Paper  No.  212,  Western  Management  Science  Institute,  UCLA, 
November  1973. 

9.  Lippman,  S.A. ,  "Countable  State  Continuous  Time  Dynamic  Programming 
with  Structure,"  Discussion  Paper  No.  42,  Operations  Research  Study 
Center,  Graduate  School  of  Management,  UCLA,  December  1973. 

10.  Miller,  B.L. ,  "Finite  State  Continuous  Time  Markov  Decision  Processes 
with  Applications  to  a  Class  of  Optimization  Problems  in  Queueing 
Theory,"  Technical  Report  No.  15,  Department  of  Operations  Research, 
Stamford  University,  March  10,  1967. 


24 


I 


11.  Miller,  B.L. ,  "Finite  State  Continuous  Time  Markov  Decision  Processes 
with  a  Finite  Planning  Horizon,"  SIAM  J.  on  Control,  £,  266-280  (1968). 

12.  Miller,  B.L.,  "Finite  State  Continuous  Tire  Markov  Decision  Processes 
with  an  Infinite  Planning  Horizon,"  J.  Math.  Anal,  and  Appl.,  22, 
552-269  (1968). 

13.  Miller,  B.L.,  "A  Queueing  Reward  System  with  Several  Customer  Classes," 
Management  Science,  16,  234-245  (1969) . 

14.  Prabu,  N.  and  S.  Stidham,  Jr.,  "Optimal  Control  of  Queueing  Systems," 
Technical  Report  No.  186,  Department  of  Operations  Research,  Cornell 
University,  1973. 

15.  Sabeti,  H. ,  "Optinal  Decision  in  Queueing,"  Technical  Report  No.  12, 
Operations  Research  Center,  University  of  California,  Berkeley, 

April  1970. 

16.  Veinott,  A.F. ,  Jr. ,  "Discrete  Dynamic  Programming  with  Sensitive 
Discount  Optimality  Criteria,"  Ann.  Math.  Stat. ,  40,  1635-1660  (1969). 


